62 ◾ Bioinformatics
2.2.5 FM-Index
FM-index (Full-text Minute-space index) performs a backward search mechanism on the
top of BWT to find exact pattern matches on a string with a linear computational time
complexity O(N), regardless of the length of the string. Moreover, it could achieve high
compression ratios that allow the indexing of the whole human genome into less than 2.0G
of memory storage [2]. An FN-index searching is performed by using LF mapping matrix.
Searching for a read (a substring) with FM-index uses the backward matching in which the
LF mapping is used repeatedly until matches for the substring are found. FM-index intro-
duces the rank table and lookup table. A rank table is generated from the BWT (column L)
using the character’s rank, which is defined as the number of times a character occurs pre-
viously in the prefix L[1…k] (i.e., lists the occurrence and order of each unique character).
The rank table acts as a function B(c, k), which is explained as follows: if a character c and
a prefix of length k are given, then we can infer the number of times that character occurs
in the prefix. A lookup table lists the index of the first occurrence of each character c from
the first column (column F) of the sorted matrix. The first occurrence of the character c in
the lookup table can be described as C[c].
Figure 2.12 shows Burrows–Wheeler matrix (a), the rank table (b), and the lookup table
(c). Using the rank table and lookup table, the LF mapping can be described as
LF(i) = C[L[i]] + B(L[i], i)
The original string can be recovered using the above rule and backward mechanisms are
as follows:
The last character in the string is “$” on the first row in column F, preceded by “A” on
the first row in column K. So, the last two characters are “A$”. Now we can use the above
rule to find the character that comes before “A$”. Since “A” is on row 1 in column L, we can
locate that character in column F as LF(1) as follows:
LF(1) = C(A] + B(A, 1) = 1 + 2 =2 : The character is on row 2 in column F. So, “A” is pre-
ceded by “G”, which is on row 2 in column L. The last three characters are “GA$”.
LF(2) = C(G] + B(G, 2) = 4 + 1 =5 : The character is on row 5 in column F. So, “G” is pre-
ceded by “G”, which is on row 5 in column L. The last four characters are “GGA$”.
LF(5) = C(G] + B(G, 5) = 4 + 3 =7 : The character is on row 7 in column F. So, “G” is pre-
ceded by “T”, which is on row 7 in column L. The last four characters are “TGGA$”.
LF(7) = C(T] + B(T, 7) = 8 + 1 =9 : The character is on row 9 in column F. So, “T” is pre-
ceded by “C”, which is on row 9 in column L. The last four characters are “CTGGA$”.
We can continue using that rule with the rank table and lookup table until we recover
the entire spring “CTTGGCTGGA$”.
The FM-index can also be used as a pattern searching technique that operates on the
BWT to map read sequences to locations on the BW-transformed reference genome.
Searching for a pattern is a backward process like recovering the original string; a
character is processed at a time, beginning with the last character of the pattern (read
sequence) [11].